================================================================================================================

Introduction

The task of the project was to analyze the wine data for the Portuguese “Vinho Verde” wine. The data set includes over 10 variables which pertain to the chemical composition of the wines and a resulting categorical variable of quality which is obtained by an average ranking task performed by 3 wine experts. The analysis in this project will attempt to determine the relationship between the chemical contents of wine and its quality rating.

Overview of the Data

Size of the Wine Data Sets

The data set for White wines has more than 4500 entries for 11 variables. Below mentioned are the dimensions of the data set used for this report.
## [1] 4898   18
Data set for Red wines has more than 1500 rows with same number of variables as the White wine data set.
## [1] 1599   18
NOTE: There are 4 additional variables/features created in order for better representation of data. These will be discussed in the analysis section.

Overview of the Variables

Following are the names of the variables and brief description of the data types for the two data sets. An additional variable was defined in both the data sets to differentiate between the type of wines. This variable will be useful for combined and comparative analysis of the two data sets.

White wine overview

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "type"                 "combined.acidity"    
## [16] "sugar.acid.ratio"     "taste"                "taste.due.to.pH"
## 'data.frame':    4898 obs. of  18 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ type                : Ord.factor w/ 1 level "White": 1 1 1 1 1 1 1 1 1 1 ...
##  $ combined.acidity    : num  7.63 6.94 8.78 7.75 7.75 8.78 6.68 7.63 6.94 8.75 ...
##  $ sugar.acid.ratio    : num  2.713 0.231 0.786 1.097 1.097 ...
##  $ taste               : Ord.factor w/ 4 levels "Dry"<"Medium_Dry"<..: 3 1 1 2 2 1 2 3 1 1 ...
##  $ taste.due.to.pH     : Ord.factor w/ 4 levels "Dry"<"Medium_Dry"<..: 3 2 1 2 2 1 2 3 2 1 ...

Red wine overview

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "type"                 "combined.acidity"    
## [16] "sugar.acid.ratio"     "taste"                "taste.due.to.pH"
## 'data.frame':    1599 obs. of  18 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ type                : Ord.factor w/ 1 level "Red": 1 1 1 1 1 1 1 1 1 1 ...
##  $ combined.acidity    : num  8.1 8.68 8.6 12.04 8.1 ...
##  $ sugar.acid.ratio    : num  0.235 0.3 0.267 0.158 0.235 ...
##  $ taste               : Ord.factor w/ 2 levels "Dry"<"Medium_Dry": 1 1 1 1 1 1 1 1 1 1 ...
##  $ taste.due.to.pH     : Ord.factor w/ 3 levels "Dry"<"Medium_Dry"<..: 3 1 1 1 3 3 2 2 2 2 ...

Overview of the Combined Data Set

Combining the two data sets can help us reveal interesting insights to the chemical composition of wines and the resulting variable of quality.
## [1] 6497   18

Quality of a Wine Samples (The only categorical variable in the original data)

The quality of a wine samples was rated by at least 3 wine experts and was categorized between ratings from 0 (very bad) and 10 (excellent). In these particular data sets, wine quality ranges between 3 and 9.
## [1] 3 4 5 6 7 8 9

Summary of the Data Set

White Wine

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality         type      combined.acidity
##  Min.   : 8.00   Min.   :3.000   White:4898   Min.   : 4.130  
##  1st Qu.: 9.50   1st Qu.:5.000                1st Qu.: 6.890  
##  Median :10.40   Median :6.000                Median : 7.405  
##  Mean   :10.51   Mean   :5.878                Mean   : 7.467  
##  3rd Qu.:11.40   3rd Qu.:6.000                3rd Qu.: 7.960  
##  Max.   :14.20   Max.   :9.000                Max.   :14.960  
##  sugar.acid.ratio           taste          taste.due.to.pH
##  Min.   :0.06459   Dry         :3053   Dry         :2286  
##  1st Qu.:0.23495   Medium_Dry  :1591   Medium_Dry  :1985  
##  Median :0.72251   Medium_Sweet: 253   Medium_Sweet: 586  
##  Mean   :0.85776   Sweet       :   1   Sweet       :  41  
##  3rd Qu.:1.28738                                          
##  Max.   :7.02616
Summary: Majority of the white wine samples in this data set have quality around 6 with the mean at 5.9. So we can safely assume that the samples taken were rated by wine experts to have above average quality. The alcohol percentage for most of the samples is around 10-11%. The mean value of total SO2 is around 138 which might cause a slight smell in the nose and taste of wine. The pH of most of the samples is close to 3 which is on the acidic side of the pH scale. The additional variables of combined.acidity and sugar.acid.ratio were created and further used to create categorical variables which helped determine the taste of wine samples. For white wine most fell under the category of dry to medium dry and only a handful of samples were on the sweeter side.

Red Wine

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality       type      combined.acidity
##  Min.   : 8.40   Min.   :3.000   Red:1599   Min.   : 5.270  
##  1st Qu.: 9.50   1st Qu.:5.000              1st Qu.: 7.827  
##  Median :10.20   Median :6.000              Median : 8.720  
##  Mean   :10.42   Mean   :5.636              Mean   : 9.118  
##  3rd Qu.:11.10   3rd Qu.:6.000              3rd Qu.:10.070  
##  Max.   :14.90   Max.   :8.000              Max.   :17.045  
##  sugar.acid.ratio        taste          taste.due.to.pH
##  Min.   :0.1053   Dry       :1580   Dry         :717   
##  1st Qu.:0.2117   Medium_Dry:  19   Medium_Dry  :691   
##  Median :0.2482                     Medium_Sweet:191   
##  Mean   :0.2854                                        
##  3rd Qu.:0.3008                                        
##  Max.   :2.0807
Summary: Red wine samples in this data set have a median quality of 6 with the mean a little lower at 5.6. So we can safely assume for the samples of this data set as well that they were rated to have above average quality. The alcohol percentage for most of the samples is around 10-10.5%, similar to that of white wine samples. The mean value of total SO2 is around 46.5 which is much lower than what we observed for white wine. The pH of most of the samples is again close to 3. The additional variables created for taste show that red wine samples are mostly on the dry side.

Univariate Plots Section

This section explores the variables in the data set in form of uni-variate charts and plots.
NOTE: For histograms, white bars are used for white wine data and red bars for red wine data.

Distribution of Alcohol content of wine samples

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

The alcohol percentage distribution of the wine samples is multimodal with major peak at around 9.5% for both type of wines and smaller peak at 11%. The transformed plot with log10 for alcohol shows a similar distribution as well. The frequency polygon shows that for most of the samples alcohol percenatage is between 9% and 13% with multiple peaks.

pH variation in the dataset

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

The pH values for the samples seemed to be evenly distributed with mean around 3.2 for white and 3.3 for red wine. The log transformation also shows even distribution of pH. Analysis alongwith other variables in the bi and multivariate sections might reveal more interesting relationships of pH with other variables in the dataset.

Sulphate (SO4) distribution of the samples

The distribution of sulphates is skewed to left for both wine types while the log transformed plot shows it be some what evenly distributed.

Distribution of Sulphur Dioxide (SO2)

Free Sulphur Dioxide

The distribution of free sulphur dioxide for samples of both type of wines is heavily skewed to left showing that generally SO2 is less than 100 mg/dm3. The red wine distribution being more skewed than white wine, shows slight bi-modality in the transformed plot.

Total Sulphur Dioxide

Total Suphur Dioxide being a combination of free and bound SO2 is mostly undetected in wine . The red wine samples are heavily skewed to the left while white wine samples show bell shaped distribution. Heavy skewness for red wine samples shows wide spread with multimodality in the transformed plot.

Distribution of Acidic content in the Wine samples

Total Acidic Content

Total acidic content for the two wine types is clearly different. The samples for White wine for all the acid types follow an approximate normal distribution, while the ones for red wine show slight multimodality for fixed.acitdity and multi-modality for volatile.acidity and citric acid.
## [1] 132
There are 132 samples of red wine which have no citric acid.

Combined Acidity

The combined acidity plots show a slightly skewed distribution to the left for red wine samples while white wine samples follow an approximate normal distribution. This created feature will help determine the taste of the wine samples.

Sugar content in the Wine samples

Histogram for Residual Sugar shows that most of the samples have sugar level between 1-3 g/dm3. The transformed plot for white wine samples also shows multi-modality with peaks at around 0.2, 0.9 and 1.2 g/dm3. The red wine samples seemed to more evenly distributed with slight skewness to the left.

Detailed histogram for Residual Sugar shows that most of the samples have sugar level between 1 - 3 g/dm3. Lot also shows the detail of multimodality in white wine samples.

Sugar to acid ratio of Wine samples

The sugar to scid ratio plots are heavily skewed towards the left with the transformed plots showing bi-modality having two significant peaks for white wine samples while red wine samples are more evenly distributed following an approximate normal distribution.

Taste of Wine samples

The overall taste distribution (variable based on the information at this url: http://drinkriesling.com/tastescale/thescale) is on the dry side. Summary for taste variation in both the data sets is shown below.

Taste of White Wine

##          Dry   Medium_Dry Medium_Sweet        Sweet 
##         3053         1591          253            1

Taste of Red Wine

##        Dry Medium_Dry 
##       1580         19

Taste of Wine samples due to pH

Taste distribution due to pH changes (variable based on the information at this url: http://drinkriesling.com/tastescale/thescale) has been shifted up a bucket to the Medium Dry. Summary for taste variation due to pH in both the data sets is shown below.

Taste of White Wine due to PH

##          Dry   Medium_Dry Medium_Sweet        Sweet 
##         2286         1985          586           41

Taste of Red Wine due to PH

##          Dry   Medium_Dry Medium_Sweet 
##          717          691          191

Saltiness in Wine samples

Plots show most samples for both the types lie between 0-0.1 g/dm3 with few outliers having salt content around 0.4 g/dm3. The transformed plot for white wine samples show peak at -1.25 for white wine samples and -1.2 for the red wine samples.
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 485   485           6.2            0.370        0.30            6.6
## 1218 1218           8.0            0.610        0.38           12.1
## 4916   18           8.1            0.560        0.28            1.7
## 4918   20           7.9            0.320        0.51            1.8
## 4941   43           7.5            0.490        0.20            2.6
## 4980   82           7.8            0.430        0.70            1.9
## 4982   84           7.3            0.670        0.26            1.8
## 5005  107           7.8            0.410        0.68            1.7
## 5050  152           9.2            0.520        1.00            3.4
## 5068  170           7.5            0.705        0.24            1.8
## 5125  227           8.9            0.590        0.50            2.0
## 5157  259           7.7            0.410        0.76            1.8
## 5180  282           7.7            0.270        0.68            3.5
## 5190  292          11.0            0.200        0.48            2.0
## 5350  452           8.4            0.370        0.53            1.8
## 5591  693           8.6            0.490        0.51            2.0
## 5629  731           9.5            0.550        0.66            2.3
## 5653  755           7.8            0.480        0.68            1.7
## 5950 1052           8.5            0.460        0.59            1.4
## 6064 1166           8.5            0.440        0.50            1.9
## 6159 1261           8.6            0.635        0.68            1.8
## 6218 1320           9.1            0.760        0.68            1.7
## 6269 1371           8.7            0.780        0.51            1.7
## 6271 1373           8.7            0.780        0.51            1.7
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 485      0.346                  79                  200 0.99540 3.29
## 1218     0.301                  24                  220 0.99930 2.94
## 4916     0.368                  16                   56 0.99680 3.11
## 4918     0.341                  17                   56 0.99690 3.04
## 4941     0.332                   8                   14 0.99680 3.21
## 4980     0.464                  22                   67 0.99740 3.13
## 4982     0.401                  16                   51 0.99690 3.16
## 5005     0.467                  18                   69 0.99730 3.08
## 5050     0.610                  32                   69 0.99960 2.74
## 5068     0.360                  15                   63 0.99640 3.00
## 5125     0.337                  27                   81 0.99640 3.04
## 5157     0.611                   8                   45 0.99680 3.06
## 5180     0.358                   5                   10 0.99720 3.25
## 5190     0.343                   6                   18 0.99790 3.30
## 5350     0.413                   9                   26 0.99790 3.06
## 5591     0.422                  16                   62 0.99790 3.03
## 5629     0.387                  12                   37 0.99820 3.17
## 5653     0.415                  14                   32 0.99656 3.09
## 5950     0.414                  16                   45 0.99702 3.03
## 6064     0.369                  15                   38 0.99634 3.01
## 6159     0.403                  19                   56 0.99632 3.02
## 6218     0.414                  18                   64 0.99652 2.90
## 6269     0.415                  12                   66 0.99623 3.00
## 6271     0.415                  12                   66 0.99623 3.00
##      sulphates alcohol quality  type combined.acidity sugar.acid.ratio
## 485       0.58     9.6       5 White            6.870        0.9606987
## 1218      0.48     9.2       5 White            8.990        1.3459399
## 4916      1.28     9.3       5   Red            8.940        0.1901566
## 4918      1.08     9.2       6   Red            8.730        0.2061856
## 4941      0.90    10.5       6   Red            8.190        0.3174603
## 4980      1.28     9.4       5   Red            8.930        0.2127660
## 4982      1.14     9.4       5   Red            8.230        0.2187120
## 5005      1.31     9.3       5   Red            8.890        0.1912261
## 5050      2.00     9.4       4   Red           10.720        0.3171642
## 5068      1.59     9.5       5   Red            8.445        0.2131439
## 5125      1.61     9.5       6   Red            9.990        0.2002002
## 5157      1.26     9.4       5   Red            8.870        0.2029312
## 5180      1.08     9.9       7   Red            8.650        0.4046243
## 5190      0.71    10.5       5   Red           11.680        0.1712329
## 5350      1.06     9.1       6   Red            9.300        0.1935484
## 5591      1.17     9.0       5   Red            9.600        0.2083333
## 5629      0.67     9.6       5   Red           10.710        0.2147526
## 5653      1.06     9.1       6   Red            8.960        0.1897321
## 5950      1.34     9.2       5   Red            9.550        0.1465969
## 6064      1.10     9.4       5   Red            9.440        0.2012712
## 6159      1.15     9.3       5   Red            9.915        0.1815431
## 6218      1.33     9.1       6   Red           10.540        0.1612903
## 6269      1.17     9.2       5   Red            9.990        0.1701702
## 6271      1.17     9.2       5   Red            9.990        0.1701702
##           taste taste.due.to.pH
## 485         Dry             Dry
## 1218 Medium_Dry      Medium_Dry
## 4916        Dry             Dry
## 4918        Dry             Dry
## 4941        Dry             Dry
## 4980        Dry             Dry
## 4982        Dry             Dry
## 5005        Dry             Dry
## 5050        Dry             Dry
## 5068        Dry             Dry
## 5125        Dry             Dry
## 5157        Dry             Dry
## 5180        Dry             Dry
## 5190        Dry      Medium_Dry
## 5350        Dry             Dry
## 5591        Dry             Dry
## 5629        Dry             Dry
## 5653        Dry             Dry
## 5950        Dry             Dry
## 6064        Dry             Dry
## 6159        Dry             Dry
## 6218        Dry             Dry
## 6269        Dry             Dry
## 6271        Dry             Dry
Here is a subset of the data set showing outliers for salt content with values greater than 0.3 g/dm3

Detailed plot shows normality in the regular histogram. The major peak is at 0.045 g/dm3 for white wine and at 0.085 g/dm3 for red wine.

Density of the Wine samples

Density for both wine types is primarily observed between 0.5 - 1 gm/cm3 with the distributions following close to normal distribution.

Qualiy of Wine samples

Quality being the only categorical and the output variable has value of 6 for most of the samples of white wine while 5 is the quality rating given to most of the red wine samples.

Univariate Analysis

What is the structure of your dataset?

The dataset consists of 6497 observations for wines, out of which 4898 are for white wine and the remaining for red wines. In the original data set there were 12 variables with one being an output variable (quality). Qualiy is based on sensory data provided by at least 3 wine experts and is scored between 0 (poor) and 10 (excellent).
Useful observations at a glance:
- Majority of the wine samples have been rated with quality rating of 6 and mean rating of 5.6
- The mean alcohol content in both type of wines is around 10.5%
- Mean total SO2 for white is 138.4 and red is 46.47, while the max total SO2 are 440 and 289 respectively, suggesting them to be outliers.
- Similar stats can be observed for the free SO2 content in the samples as median values are far below the max values.
- Residual sugar samples for both wine types also show anomalies with max values being extremely high compared to the 3rd quartile values.

What is/are the main feature(s) of interest in your dataset?

Main features in my opinion from the Wine data sets are quality and alcohol. Other features that might play an important role in deteremining the quality of a wine sample could be residual sugar and acidic contents (fixed, volatile and citric).
Correlation results for Quality and Alcohol are shown below:

White Wine

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$alcohol and wqw$quality
## t = 33.8585, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

Red Wine

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$alcohol and wqr$quality
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Additional features that might be effective in determining the quality of wine could be pH value, density and sulphate content (which acts as anti-oxidant).

Did you create any new variables from existing variables in the dataset?

The following new variables were created using the existing variables:
- Combination of all the acidity measuring variables were combined to form ‘combined.acidity’, i.e. the sum of fixed.acidity, volatile.acidity & citric.acid. This new variable was used as an input for another new variable.
combined.acidity = fixed.acidity + volatile.acidity + citric.acid
- A variable for ratio of Sugar to Acid (sugar.acid.ratio) in the wine samples was also created to measure the propotion of dryness and sweetness.
sugar.acid.ratio = residual.sugar / combined.acidity
- A variable for ‘type’ was created when combining the White and Red wine data sets to distinguish between the two.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The histogram for alcohol when first created showed multi-modality and was hard to interpret. Frequency polygon was created so that the data looks tidier and is easy to interpret. Alcohol being an important feature in the data set, the frequency polygon shows the variations clearly.
Distributions for residual sugar, free sulphur dioxide and sugar to acid ratio were transformed to view modality. Histograms for all of these plots were skewed towards the left when intially plotted and plotting log10 transformation revealed multi-modality. Plot of residual sugar for white wine samples shows peaks at around 0.2, 0.9 and 1.2. Free sulphur dioxide transformed plot for red wine samples shows multi-modality with no significant peak. The sugar to acid ratio histogram for white wine showed large number of samples with ratio less than 0.25 and when transformed showed multi-modality with peaks at around -0.7, 0.1 and 0.3.
For ease of viewing, some of the plots were adjusted by tweaking the scales. The histograms for chlorides and residual sugar were adjusted by changing scales and binwidth to view the distribution in more detail. Residual sugar content in white wine gave an insight for the widespread distribution of the samples which ultimately was proven by the transformed plot. Adjusting the Chloride plot showed a more sampled version which was easier to interpret.

Bivariate Plots Section

We will start off by tabulating/plotting the original variables from the Wine data sets to infer about interesting relationships.

Correlation between Wine data set variables

Correlation table of White Wine

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.02        0.29
## volatile.acidity             -0.02             1.00       -0.15
## citric.acid                   0.29            -0.15        1.00
## residual.sugar                0.09             0.06        0.09
## chlorides                     0.02             0.07        0.11
## free.sulfur.dioxide          -0.05            -0.10        0.09
## total.sulfur.dioxide          0.09             0.09        0.12
## density                       0.27             0.03        0.15
## pH                           -0.43            -0.03       -0.16
## sulphates                    -0.02            -0.04        0.06
## alcohol                      -0.12             0.07       -0.08
## quality                      -0.11            -0.19       -0.01
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.09      0.02               -0.05
## volatile.acidity               0.06      0.07               -0.10
## citric.acid                    0.09      0.11                0.09
## residual.sugar                 1.00      0.09                0.30
## chlorides                      0.09      1.00                0.10
## free.sulfur.dioxide            0.30      0.10                1.00
## total.sulfur.dioxide           0.40      0.20                0.62
## density                        0.84      0.26                0.29
## pH                            -0.19     -0.09                0.00
## sulphates                     -0.03      0.02                0.06
## alcohol                       -0.45     -0.36               -0.25
## quality                       -0.10     -0.21                0.01
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                        0.09    0.27 -0.43     -0.02   -0.12
## volatile.acidity                     0.09    0.03 -0.03     -0.04    0.07
## citric.acid                          0.12    0.15 -0.16      0.06   -0.08
## residual.sugar                       0.40    0.84 -0.19     -0.03   -0.45
## chlorides                            0.20    0.26 -0.09      0.02   -0.36
## free.sulfur.dioxide                  0.62    0.29  0.00      0.06   -0.25
## total.sulfur.dioxide                 1.00    0.53  0.00      0.13   -0.45
## density                              0.53    1.00 -0.09      0.07   -0.78
## pH                                   0.00   -0.09  1.00      0.16    0.12
## sulphates                            0.13    0.07  0.16      1.00   -0.02
## alcohol                             -0.45   -0.78  0.12     -0.02    1.00
## quality                             -0.17   -0.31  0.10      0.05    0.44
##                      quality
## fixed.acidity          -0.11
## volatile.acidity       -0.19
## citric.acid            -0.01
## residual.sugar         -0.10
## chlorides              -0.21
## free.sulfur.dioxide     0.01
## total.sulfur.dioxide   -0.17
## density                -0.31
## pH                      0.10
## sulphates               0.05
## alcohol                 0.44
## quality                 1.00

Correlation plot of White Wine

Summary: Density, SO2, sugar, alcohol, citric acid and fixed acidity samples for white wine show correlation with each other. Although poor, density also shows some correlation with chlorides. Main features for our data set, Alcohol and Quality have a relatively strong correlation of 0.44. Density and residual sugar have the highest correlation of 0.84. These relationships will be explored in the later sections.

Correlation table of Red Wine

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

Correlation plot of Red Wine

Summary: For red wine, density, SO2, sugar, chlorides, citric acid and fixed acidity samples show stronger correlations with each other. Alcohol and Quality for red wine also have a strong correlation of 0.48. Density and fixed.acidity have the highest correlation with r^2 = 0.67. Relationships are further explored in the sections below.
Now lets explore the relationships in greater detail beginning our analysis with the main variables of interest.

Relation between Quality and Alcohol

Revisiting Alcohol distribution for both types of wines

Scatter Plot for White wine (Quality vs. Alcohol)

Table for Quality vs. Alcohol of White wine

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90
It is clearly evident that alcohol percentage in the samples tend to produce better quality or the wine expert tend to give higher quality rating to the samples with higher alcohol content. The relationship between alcohol content in the sample seems some what linear with respect to quality. Summary table also shows that quality gets better with increase in mean alcohol content.
NOTE: Jitter feature of ggplot has not used in this section as it distributed the concentration of points and then it became difficult for me to find a trend. Jitter is used extensively in the multivariate section.

Box Plot for White wine (Quality vs. Alcohol)

The box plot reveals another angle to the relationship between quality and alcohol. It turns out thant both the mean and median alcohol levels drop from quality rating of 3 to 5. Then for higher rating (above 5) mean and median levels of alcohol increase almost linearly.

Scatter Plot for Red wine (Quality vs. Alcohol)

Table for Quality vs. Alcohol of Red wine

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00
Red wine data set has fewer observations but shows clear relation between alcohol and quality. Generally, more the alcohol content better the quality rating. The summay table above is also consistent with what is observed in the scatter plot.

Box Plot for Red wine (Quality vs. Alcohol)

Relationship between Quality and Taste for White wine

Plot for Quality vs. Taste of White wine

Plots suggest that most of the samples with higher quality fall under Dry to Medium Dry taste cateogries. Even though the ranges shift due pH but major contributor towards quality is the sample with taste on the dry side. Summary table for the result is shown below.

Table for Quality vs. Taste of White wine

## wqw$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.958   7.000   9.000 
## -------------------------------------------------------- 
## wqw$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.775   6.000   9.000 
## -------------------------------------------------------- 
## wqw$taste: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   6.000   5.553   6.000   8.000 
## -------------------------------------------------------- 
## wqw$taste: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6
NOTE: Category Sweet is an exception as it has only one instance.

Table for Quality vs. Taste (pH) of White wine

## wqw$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.886   6.000   9.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.907   6.000   9.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.787   6.000   8.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   5.000   5.000   5.341   6.000   6.000

Relationship between Quality and Taste for Red wine

Plot for Quality vs. Taste of Red wine

For red wine, plots tell a similar story. Samples tend to taste more dry than white wine as evident from taste vs. quality histogram (there are no ranges for medium sweet and sweet). Plotting taste due pH changes again show more contribution from the dryer samples towards good quality. Quality for different buckets of taste are summarized below.

Table for Quality vs. Taste of Red wine

## wqr$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.637   6.000   8.000 
## -------------------------------------------------------- 
## wqr$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   5.000   6.000   5.526   6.000   6.000

Table for Quality vs. Taste (pH) of Red wine

## wqr$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.681   6.000   8.000 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.618   6.000   8.000 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.534   6.000   8.000

Alcohol relationship with pH

Distribution of pH for wine samples revisted

Scatter Plot for White wine (Alcohol vs. pH)

Correlation coefficient for White wine (Alcohol vs. pH)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$alcohol and wqw$pH
## t = 8.5601, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09374446 0.14893205
## sample estimates:
##       cor 
## 0.1214321
Although, most of the samples of white wine are concentrated in the pH range of 3.0-3.3, there seems to be a slight positive correlation between alcohol and pH. This relationship is shown by the correlation coefficient calculation.

Scatter Plot for Red wine (Alcohol vs. pH)

Correlation coefficient for Red wine (Alcohol vs. pH)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$alcohol and wqr$pH
## t = 8.397, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1582061 0.2521123
## sample estimates:
##       cor 
## 0.2056325
Most of the points are concentrated between pH of 3.2-3.5 and alcohol percentage of 9-11. Apart from the few outliers, there seems to be a positive correlation between the two variables, the factor is confirmed by the correlation test performed.

Variations in Alcohol with respect to Taste for White wine

Frequency Plot for Alcochol vs. Taste of White wine

Box Plot for Alcochol vs. Taste of White wine

It is clearly evident from the frequency plots , wine samples that are dry have more alcohol content while sweeter samples have less. Box plots are also consistent with what is in the frequency plots and follow a linear trend showing that decrease in alcohol content leads towards sweeter taste.

Table for Alcochol vs. Taste of White wine

## wqw$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   10.00   10.90   10.94   11.80   14.20 
## -------------------------------------------------------- 
## wqw$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.100   9.500   9.871  10.400  14.050 
## -------------------------------------------------------- 
## wqw$taste: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.500   8.800   9.100   9.407   9.600  13.000 
## -------------------------------------------------------- 
## wqw$taste: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    11.7    11.7    11.7    11.7    11.7    11.7
NOTE: Category ‘Sweet’ in taste vs. alcohol is an exception as it has only one instance.

Table for Alcochol vs. Taste (pH) of White wine

## wqw$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.90   10.80   10.89   11.89   14.20 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.30   10.00   10.26   11.00   14.05 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.100   9.800   9.981  10.500  14.000 
## -------------------------------------------------------- 
## wqw$taste.due.to.pH: Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.700   8.800   9.500   9.622  10.100  12.400

Variations in Alcohol with respect to Taste for Red wine

Frequency Plot for Alcochol vs. Taste of Red wine

Box Plot for Alcochol vs. Taste of Red wine

For red Wine, plot colored by taste does not depict the exact effect of alcohol on taste as samples with ‘Medium Dry’ range are pretty low. Plotting alcohol colored by taste.due.to.pH reveals a linear trend showing that with increase in alcohol percentage, taste gets sweeter.

Table for Alcochol vs. Taste of Red wine

## wqr$taste: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.43   11.10   14.90 
## -------------------------------------------------------- 
## wqr$taste: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.800   9.200   9.900   9.984  10.400  12.200

Table for Alcochol vs. Taste (pH) of Red wine

## wqr$taste.due.to.pH: Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.29   11.00   14.90 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Dry
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.70    9.50   10.30   10.41   11.00   14.00 
## -------------------------------------------------------- 
## wqr$taste.due.to.pH: Medium_Sweet
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   9.233   9.850  10.800  10.960  11.700  14.000

Relationship between Quality and Sulphates

Sulphates revisited

Scatter Plot for White wine (Quality vs. Sulphates)

Box Plot for White wine (Quality vs. Sulphates)

Table for Quality vs. Sulphates of White wine

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2800  0.3800  0.4400  0.4745  0.5425  0.7400 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4700  0.4761  0.5400  0.8700 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2700  0.4200  0.4700  0.4822  0.5300  0.8800 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.4100  0.4800  0.4911  0.5500  1.0600 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4800  0.5031  0.5800  1.0800 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.3800  0.4600  0.4862  0.5850  0.9500 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.360   0.420   0.460   0.466   0.480   0.610
The Sulphates relation with quality of white wine shows no significant correlation. For mean sulphate content it shows fluctuating trend between rating 3-5, then increases very slightly from rating 5-7 followed by a drop for rating 8 and 9 as observed in the box plot representation. The median value of sulphate content (as per the summary table) for all quality buckets is almost the same i.e. (0.46-0.48).

Scatter Plot for Red wine (Quality vs. Sulphates)

Box Plot for Red wine (Quality vs. Sulphates)

Table for Quality vs. Sulphates of Red wine

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000
Red wine samples have a slightly better correlation between sulphates and quality. As evident from the box plots and the scatter plot, increase in sulphates per samples on average results in a better quality rating. This is also evident in the summary table, mean and median values for sulphate content show increasing trend.

Variations in Quality due to pH for White wine

Scatter Plot for White wine (Quality vs. pH)

Box Plot for White wine (Quality vs. pH)

Table for Quality vs. pH of White wine

## wqw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.870   3.035   3.215   3.188   3.325   3.550 
## -------------------------------------------------------- 
## wqw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.830   3.070   3.160   3.183   3.280   3.720 
## -------------------------------------------------------- 
## wqw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.790   3.080   3.160   3.169   3.240   3.790 
## -------------------------------------------------------- 
## wqw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.080   3.180   3.189   3.280   3.810 
## -------------------------------------------------------- 
## wqw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.840   3.100   3.200   3.214   3.320   3.820 
## -------------------------------------------------------- 
## wqw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.940   3.120   3.230   3.219   3.330   3.590 
## -------------------------------------------------------- 
## wqw$quality: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.200   3.280   3.280   3.308   3.370   3.410
The relationship between quality and pH shows a slight linear trend. Mean and median values (as viewed in the box plot and summary table) for pH decrease slighly for lower quality ratings between 3-5 and then start to gradually increase showing slight contribution towards the quality of the white wine samples.

Relation between Quality and Citric Acid for Red wine

Citric Acid Histogram for Red wine samples

Scatter Plot for Red wine (Quality vs. Citric Acid)

Box Plot for Red wine (Quality vs. Citric Acid)

Table for Quality vs. Citric Acid of Red wine

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200
The scatter plot and mean-median values on box plot for citric acid show gradual increase with increase in quality or it can be said, quality increases as citric acid content increases for the samples of Red wine.
Let us explore correlation between features other than alcohol and quality.

Correlation between Density and Residual Sugar

Residual Sugar Histograms revisited

Scatter Plot for White wine (Density vs. Residual Sugar)

Correlation coefficient for White wine (Density vs. Residual Sugar)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$density and wqw$residual.sugar
## t = 107.8749, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665
The scatter plot and the correlation coefficient clearly show a strong relation between density and residual sugar for white wine samples. An r^2 of 0.84 suggests that density and residual sugar can be associated in form of a linear equation.

Scatter Plot for Red wine (Density vs. Residual Sugar)

Correlation coefficient for Red wine (Density vs. Residual Sugar)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$density and wqr$residual.sugar
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3116908 0.3973835
## sample estimates:
##       cor 
## 0.3552834
Correlation between density and residual sugar for Red wine is low compared to the White wine samples but scatter plot and r^2 do show some association. This relation can also be represented by a linear equation but with a very low slope.

Relationship between Density and Fixed Acidity

Fixed Acidity Histograms revisited

Scatter Plot for White wine (Density vs. Fixed Acidity)

Correlation coefficient for White wine (Density vs. Fixed Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$density and wqw$fixed.acidity
## t = 19.2558, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2391013 0.2911738
## sample estimates:
##      cor 
## 0.265331
The scatter plot shows some correlation between density and fixed acidity for white wine samples. This also depicted by the r^2 calculated.

Scatter Plot for Red wine (Density vs. Fixed Acidity)

Correlation coefficient for Red wine (Density vs. Fixed Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$density and wqr$fixed.acidity
## t = 35.8771, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473
The scatter plot for red wine data has wide distribution of points but shows very high correlation between density and fixed acidity. This is also evident from the correlation coefficient calculated above.

Correlation between Sugar to Acid ratio and Total Sulphur Dioxde for White wine

Distribution of Sugar to Acid Ratio and Total Sulphur Dioxide for White wine samples revisited

Scatter Plot for White wine (Sugar to Acid Ratio vs. Total Sulphur Dioxide)

Correlation coefficient for White wine (Sugar to Acid Ratio vs. Total Sulphur Dioxide)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$sugar.acid.ratio and wqw$total.sulfur.dioxide
## t = 28.6273, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3544143 0.4024012
## sample estimates:
##       cor 
## 0.3786622
I created ‘Sugar to Acid ratio’ variable as an input to determine taste for wine samples. Plotting it across total SO2 revealed that these variables have some what strong correlation. Running cor.test showed a relatively higher r^2 of 0.37.

Relation between Density and Combined Acidity for Red wine

Histogram of Combined Acidity for Red wine samples revisited

Scatter Plot for Red wine (Density vs. Combined Acidity)

Correlation coefficient for Red wine (Density vs. Combined Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$density and wqr$combined.acidity
## t = 36.6195, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6480371 0.7013884
## sample estimates:
##       cor 
## 0.6755962
Variable ‘Combined Acidity’ was created also as an input to ientify taste buckets for wine samples. Comparing it to density revealed high correlation of 0.68.

Relationship between Residual Sugar and Combined Acidity

Scatter Plot for White wine (Residual Sugar vs. Combined Acidity)

Correlation coefficient for White wine (Residual Sugar vs. Combined Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$residual.sugar and wqw$combined.acidity
## t = 7.3692, df = 4896, p-value = 2.005e-13
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07695678 0.13235571
## sample estimates:
##       cor 
## 0.1047375
The scatter plot for residual sugar and combined acidity shows most of the wine samples that have combined acidity of 6-9 have residual sugar content concentrated around 5-10 g/dm3. The correlation coefficient of 0.11 shows very slight association between the two variables.

Scatter Plot for Red wine (Residual Sugar vs. Combined Acidity)

Correlation coefficient for Red wine (Residual Sugar vs. Combined Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$residual.sugar and wqr$combined.acidity
## t = 5.0138, df = 1597, p-value = 5.93e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07592997 0.17245647
## sample estimates:
##       cor 
## 0.1244877
Red wine samples are distributed in a wide range of combined acidic content. The concentration of points is mostly below 4 g/dm3 of residual sugar. However, with so much scattering compared to white wine, the r^2 of 0.12 is slightly better than that for white wine.

Relationship between Sulphates and Chlorides

Scatter Plot for White wine (Sulphates vs. Chlorides)

Correlation coefficient for White wine (Density vs. Combined Acidity)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$sulphates and wqw$chlorides
## t = 1.1731, df = 4896, p-value = 0.2408
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01124885  0.04474833
## sample estimates:
##        cor 
## 0.01676288
Sulphates for white wine samples are plotted against chlorides. Almost 90% of the points have chloride content less than 0.1 g/dm3. Plot shows poor relationship between these two variables which is evident by a very low r^2 of 0.02.

Scatter Plot for Red wine (Sulphates vs. Chlorides)

Correlation coefficient for Red wine (Sulphates vs. Chlorides)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$sulphates and wqr$chlorides
## t = 15.9785, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3282127 0.4127694
## sample estimates:
##       cor 
## 0.3712605
Red wine samples show a much better relationship compared to white wine samples as shown by the r^2 0.37. Most of the samples are concentrated below 0.2 g/dm3 for chlorides and between 0.5-1.0 g/dm3 for sulphates.

Correlation between Total and Free SO2

As Free SO2 is one of the components of Total SO2, the correlation has to be higher. This is shown in the correlation result below.

Correlation coefficient for White wine (Total SO2 vs. Free SO2)

## 
##  Pearson's product-moment correlation
## 
## data:  wqw$total.sulfur.dioxide and wqw$free.sulfur.dioxide
## t = 54.6447, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5977994 0.6326026
## sample estimates:
##      cor 
## 0.615501

Correlation coefficient for White wine (Total SO2 vs. Free SO2)

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$total.sulfur.dioxide and wqr$free.sulfur.dioxide
## t = 35.8402, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Relationships observed in this section are as follows:
Quality with Alcohol: Scatter plots and box plots were created to identify this relation. Quality and Alcohol being the main features in the data sets, for both White and Red wines, showed high correlation of 0.44 and 0.48 respectively. Scatter plots helped view the bigger picture, i.e. positive correlation between Quality and Alcohol while box plots showed more depth, showing for lower quality ratings (3-5) alcohol levels drop and then gradually increase linearly after quality level 5.
Alcohol relationship with pH: For white wine, samples were concentrated in the pH range of 3.0-3.3 and mostly under alcohol percentage of 10.5. Lot of scattering made it difficult to identify the extent of correlation between the two variables so r^2 was calculated, it came out to be low but positive showing some relation. For red wine, most of the points were observed to be concentrated between pH of 3.2-3.5 and alcohol percentage of 9-11. By looking at the scatter plot, visually there seemed to be reasonable correlation between these variable which was confirmed by the calculated r^2 of 0.2.
Alcohol with respect to Taste and Taste due to pH: White wine samples that taste on the dry side tend to have more alcohol content while sweeter tasting samples have less alcohol. For red wine, samples that are dry have lesser alcohol content compared to the sweeter ones. This observation is evident by the box plot representation in the Bivariate Plots section.
Quality vs. Sulphates: Box and scatter plots were created to determine the relation between quality and sulphates. It seemed that white wine samples had poor correlation compared to red wine samples. For white wine, sulphate content slightly increased from quality rating of 3-7 and then dropped from then on. On the other hand, samples for red wine had better correlation as increase in average sulphate content per sample showed increase in quality.
Quality with pH for White wine: Relation of quality and pH for white wine samples followed a parabolic function, i.e. when mean and median values of pH are plotted for each qualiy rating bucket, the pH value decreased till the quality rating of 5(minima) and then started to gradually increase.
Quality with Citric Acid for Red wine: Scatter plot showed close to a linear relationship between quality and citric acid for samples of red wine. Box plots further made it clear that quality generally increases as citric acid content on average increase per sample of red wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There were interesting correlations observed between other features in the two data sets. Some of them are as follows:
Correlation between Density and Residual Sugar: Scatter diagrams were plotted and correlation coefficients were calculated to view the relation between density and residual sugar for both the wine types. For White wine samples, correlation was observed to be very strong (r^2 = 0.84) while it was lower for red wine (r^2 = 0.36).
Relationship between Density and Fixed Acidity: Again, density and fixed acidity were plotted against each other using scatter plot functionality. Most of the points were observed to be in the range of density of 0.99 - 1.00 g/cm^3 for both wine types. Along with the plots, correlation coefficients were also computed between density and fixed acidity which showed some correlation for white wine samples (r^2 = 0.27) and a stronger correlation for the red wine samples (r^2 = 0.67).
Sugar to Acid ratio vs. Total Sulphur Dioxde for White wine: Sugar to Acid ratio, a derived feature from residual sugar correlated well with Total SO2 for white wine samples. Scatter plotting and then running cor.test gave a relatively good r^2 of 0.37. This is consistent with r^2 of 0.4 between residual sugar and Total SO2.
Density with Combined Acidity for Red wine: Combined acidity a summation of all acidic components in the data sets showed very strong correlation with density for red wine samples (r^2 = 0.68).
Relationship between Sulphates and Chlorides: These are not the main variables but showed a positive correlation specailly for red wine samples. For white wine, close to 90% of the points had chloride content less than 0.1 g/dm3 with poor relationship between these two variables which was evident by a very low r^2 of 0.02.On the other hand, red wine samples show a much better relationship with comparitively high value of r^2=0.37.

What was the strongest relationship you found?

The strongest relationship observed was between density and residual sugar for white wine samples (r^2 = 0.84)
Other notable correlations are as follows:
- Citric Acid with Fixed Acidity for red wine (r^2 = 0.67)
- Density vs. Fixed Acidity for red wine (r^2 = 0.67)
- Total SO2 and Free SO2 for both data sets (r^2 = 0.67[Red] and r^2 = 0.62[White])
- Density vs. Total SO2 for white wine (r^2 = 0.53)
- Quality with Alcohol for both data sets (r^2 = 0.68[Red] and r^2 = 0.44[White])

Multivariate Plots Section

This section covers exploring relationships between multiple variables of the two wine data sets. Let us start with the main features first.

Alcohol with Quality

Density Plot of Alcohol with Quality for White wine

The density plot shows us clearly, samples with lower percentage of alcohol content lead the wine experts to give lower ratings and increase in alcohol content specially greater than 11% (with and exception in quality rating 9) generally leads them to give higher ratings like 6, 7, 8 or 9.

Density Plot of Alcohol with Quality for Red wine

The density plot for red wine shows a similar trend. Lower the percentage of alcohol, lower the rating given to the smaple and vice versa.
Let us introduce another variable and look for more interesting patterns.

Alcohol and Quality with Taste for White wine

Scatter Plot for White wine (Quality vs. Alcohol) with Taste

Samples for the white wine data set do show a trend but as most of the points in the higher and the lower quality rating buckets are distributed non-uniformly, it is really difficult to analyze. Let us take a closer look in the plot below.
NOTE: I have only focused on the mid range buckets of quality as they moajoriy of the points and I think this way it will be easier to find a trend here.

Scatter Plot for White wine (Quality vs. Alcohol) with Taste [zoomed]

## Warning in loop_apply(n, do.ply): Removed 367 rows containing missing
## values (geom_point).

Zooming in clarifies the situtation. It is clearly visible, with increasing alcohol content for the white wine samples they start to taste in the ‘Dry’ to ‘Medium Dry’ range and ultimately quality rating increases. I wonder what role would taste due to pH have here.

Scatter Plot for White wine (Quality vs. Alcohol) with Taste due to pH [zoomed]

## Warning in loop_apply(n, do.ply): Removed 369 rows containing missing
## values (geom_point).

Bringing pH into the equation, show similar results which was as expected. Although, some of the taste bins are scattered but the general trend is that pH shifts the taste up a notch to next category. Increase in alcohol content makes a wine sample more dryer resulting in better quality of white wine.

Alcohol and Quality with Taste for Red wine

Scatter Plot for Red wine (Quality vs. Alcohol) with Taste

As the samples for the red wine data sets are less compared to white wine, plotting alcohol vs. quality with taste will not tell us much. Below is a zoomed in plot using taste due to pH.

Scatter Plot for Red wine (Quality vs. Alcohol) with Taste due to pH [zoomed]

## Warning in loop_apply(n, do.ply): Removed 85 rows containing missing
## values (geom_point).

Drilling down into the quality buckets (5-7) containing most of the red wine samples, we see majority of the points do no follow a particular trend unlike white wine samples.

pH with Quality

Density Plot of pH with Quality for White wine

Plotting pH shows that (apart from the exception of quality rating 3 where pH covers a wide range) most of the mid range quality ratings are given to samples with pH from 3.0-3.2. Looking at quality rating 8 and 9 we see that there is slight increase in pH range which tells us that pH plays a role in the higher rated samples of white wine.

Density Plot of pH with Quality for Red wine

In contrast to white wine, higher ranges of pH per sample for red wine tend to receive poor ratings (the shift in the lower rating curves towards higher pH) and vice versa.

Relation of Alcohol and pH with Quality for White wine

Scatter Plot for White wine (Alcohol vs. pH) with Quality

Again, to analyze the trend here I have only selected the quality rating bucket with most samples. It can be seen for samples with lower percentage of alcohol and slightly higher pH the quality tends to suffer. On the other hand, for samples with higher alcohol content (greater than 11%) and a stable pH (eg: 3.2) the experts give good quality rating.

Relation of Alcohol and pH with Quality for Red wine

Scatter Plot for Red wine (Alcohol vs. pH) with Quality

Applying the same quality bucket configuration to red wine samples reveals similar results. With less alcohol content and pH on a higher side, experts tend to give lower ratings. In contrast to this for more alcohol content and pH controlled under 3.3 leads the expert to rank the samples high.

Sulphates and Quality

Density Plot of Sulphates with Quality for White wine

For white wine, sulphate content seems not to effect the quality rating that much. The exception is the curve for rating 9 with a spike close to 0.6 g/dm3 apart from that most of the curves for quality rating buckets overlap with each other.

Density Plot of Sulphates with Quality for Red wine

Here things look different from white wine samples. Red wine samples that receive high quality rating do have access of sulphates in them as can be seen from the curves for rating 7 and 8. There peaks stand out alone from the rest at sulphate content of 0.7 g/dm3 or more.

Sulphates and Alcohol with Quality for White wine

Scatter Plot for White wine (Sulphates vs. Alcohol) with Quality

The plot for alcohol and sulphates shows samples binned by quality rating. It is difficult to find a trend with a number of outliers over the edges. Lets look at the zoomed version in the plot below.

Scatter Plot for White wine (Sulphates vs. Alcohol) with Quality [zoomed]

## Warning in loop_apply(n, do.ply): Removed 700 rows containing missing
## values (geom_point).

I filtered the quality buckets and can only view from quality rating 5-7 (majority of the points). The points plotted are mostly scattered and do not give a very good trend as to how sulphates effect quality (in accordance with the density plots in the section above).

Sulphates and Alcohol with Quality for Red wine

Scatter Plot for Red wine (Sulphates vs. Alcohol) with Quality

Scatter plot for red wine samples is pretty straight forward and shows some degree of linearity between alcohol and sulphate content. Higher alcohol percentage and more sulphate content (around 0.8-1.0 g/dm3) leads to higher quality rating (greater than 6). The lower quality ratings (3-5) mostly occur in the region where we have low sulphate content and low alcohol percentage.

Residual Sugar with Quality

Density Plot of Residual Sugar with Quality for White wine

In my opinion, residual sugar plays an important part for the expert to decide the quality rank for a particular wine sample. By looking at the plot it looks like most of high rated samples have less than 5 g/dm3. For the remaining samples, quality ratings fluctuate between residual sugar range of 10-20 g/dm3.

Density Plot of Residual Sugar with Quality for Red wine

Most of the samples, independent of the quality rating hey receive have residual sugar level less than 4 g/dm3. For samples with residual sugar more than 4 g/dm3, ratings are mixed, i.e. not following a certain pattern.

Residual Sugar and Alcohol with Quality for White wine

Scatter Plot for White wine (Residual Sugar vs. Alcohol) with Quality

Most of the points in the scatter plot are concentrated at less than 5 g/dm3 of residual sugar as seen in the density plot and it can be viewed here in the scatter plot that samples with more sugar content do receive higher quality ratings.

Residual Sugar and Alcohol with Quality for Red wine

Scatter Plot for Red wine (Residual Sugar vs. Alcohol) with Quality [zoomed]

## Warning in loop_apply(n, do.ply): Removed 131 rows containing missing
## values (geom_point).

The zoomed version of scatter for red wine does show a trend. Samples that get higher quality ratings have residual sugar under 2.5 g/dm3 and alcohol content more than 11%. Samples with mid range quality ratings (5-6) have more residual sugar content and alcohol percentage less 10% on average.

Acidic Content and Quality

Total Acidic Content with Quality (White wine on left and Red wine on right)

Citric Acid: For white wine, almost all of the samples have citric acid content less than 0-0.6 g/dm3 and quality ratings vary between this bracket. For red wine, all samples have citric acid less than 1 g/dm3 where samples with high quality ratings have acid content more than 0.3 g/dm3 while the one with lower ratings have citric acid content less than 0.25 g/dm3 as seen in the density plot..

Acidic Content and Alcohol with Quality

Scatter plot for Total Acidic Content and Alcohol with Qualtiy (White wine on left and Red wine on right)

White wine: General trend seems that increase in acidic content does not play an important part in improving quality, however, it can be seen that samples wih increase in volatile acidity do get lower ratings.
Red wine: There is a definite trend here, samples with increased alcohol percentage and increased form of any acidic content tends to get higher quality rating and vice versa (specially fixed acidity and citric acid).

Chlorides and Quality

Density Plot of Chlorides with Quality for White wine

Salts in drinks should be balanced. It can be observed in the plot that white wine samples with salts more than 0.05g/dm3 get mid range to lower range quality rating from the experts, while balanced samples get better ratings (7-9). A samples(s) can be seen clearly receiving lowest rating of 3 as the salt content seems to be sky rocketing (around 0.24 g/dm3)

Density Plot of Chlorides with Quality for Red wine

Most of the samples have chlorides levels under 0.1 g/dm3 and quality is almost independent as raitng curves overlap with each other. Again curve for rating 3 shows variations showing that there were samples that were excessively salty.

Chlorides and Alcohol with Quality

Scatter Plot for (Chlorides vs. Alcohol) with Quality [Facet Wrap]

I tried a different approach here, used the combination of the two data sets. It can be seen, chloride amount under controllable levels (i.e. less than 0.15 g/dm3) has never effected the quality. It is only when saltiness increases we see samples start to get lower values of rating for both the wine types.

Sulphur Dioxide and Quality

SO2 with Quality (White wine on left and Red wine on right)

Most of the samples that get good quality ratings for white wine contain total SO2 in the range of 0-150 g/dm3 and Free SO2 in the range of 25-50 g/dm3. There are samples that have more SO2 content and ultimately get poor ratings from the experts. On the other hand, majority of the red wine samples that get higher rating have much lower amount of SO2. Looking at the ratings, it seems that ideal range of SO2 for red wine is less than 50 g/dm3 for total and less than 20 g/dm3 for free SO2.

Sulphur Dioxide and Alcohol with Quality

Scatter plot for Total Sulphur Dioxide and Alcohol with Qualtiy (White wine on left and Red wine on right)

White wine: There does not seem to be a very good relation between the variables but to some extent samples with increase of alcohol and conrolled amount of total and free SO2 tend to get higher ratings.
Red wine: Here it looks like there is very good relationhip between the 3 variables. Specially for free SO2, it can clearly be observed that increase in both alcohol and free SO2 leads the experts give higher ratings.

Density and Quality

Density Plot of Density with Quality for White wine

Samples that have density controlled and close to 0.99 g/cm3 tend to high quiality rating. Ratings starts to decrease as value of density increases and starts to approach 1.00 g/cm3.

Density Plot of Density with Quality for Red wine

Again for red wine also, most samples that have density close to and less than 0.995 g/cm3 get a higher rating. And as we go towards higher density, experts think the samples loosing and the wow factor and they give lower ratings.

Density and Alcohol with Quality

Scatter Plot for (Density vs. Alcohol) with Quality [Facet Wrap]

Density and Residual Sugar with Quality for White wine

Scatter Plot for (Density vs. Residual Sugar) with Quality

Density and residual sugar for white wine have the highest correlation and it can be seen here in the plot. Adding a line for reference, a divide can be seen. For increasing residual sugar per sample if the density remains in levels below the imaginary line, samples get hig ratings (7-9) and if it shoots up the threshold set by the line, the quality starts to suffer.

Density and Fixed Acidity with Quality for Red wine

Scatter Plot for (Density vs. Fixed Acidity) with Quality

Both the variables having very good correlation, do show a definite trend for quality. On average the samples with higher fixed acidic content (more than 8 g/dm3) and density of less than or equal to 0.996 /cm3 receive higher quality ratings. As the fixed acidity starts to go below 8 g/dm3 and density approaches 1.00 g/dm3 samples start to receive poor to mid range ratings.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Relationships explored in the multivariate section are as follows:
Alcohol and Quality with Taste: Density and scatter plots were used to explore the relationship between these features. For both the wine types, density plots showed that with increase in percentage of alcohol there was an increase in the quality rating for the samples. Taste due to pH was the additional variable added to the equation and it was observed that for white wine samples, with increase in alcohol they started to taste in the ‘Dry’ to ‘Medium Dry’ range and ultimately quality rating increased. On the other hand, for red wine, with introduction of taste due to pH there was not much of trend or a relation seen between these varaibles.
NOTE: Here samples for white wine in particular strengthen each other in terms of the variables alcohol, taste and quality.
Relation of Alcohol and pH with Quality: In general, according to the density plot for white wine, mid to low range qaulity rating did not really depend upon pH while for increasing pH there was improvement seen the ratings specially (8 and 9). Red wine samples told a different story, higher the pH value of the samples, lower the rating it got by the experts and vice versa. Scatter plots showed similar results for both wine types. Samples with lesser amount of alcohol and pH on the higher side tend to receive poor ratings while samples with alcohol greater than 11% and stable pH of around 3.2-3.3 received high ratings.
NOTE: I think pH, alcohol and quality strengthened each other for white wine samples.
Alcohol and Quality and Sulphates: For white wine samples, on average sulphate content did not effect the quality rating as such, however, red wine samples did show a trend. Samples with sulphate content around 0.7 g/dm3 received higher ratings compared to the ones with sulphates around 0.5 g/dm3. For scatter plots of white wine with alcohol, quality and sulphates it was difficult to pin point a particular trend. For red wine, there was a definite positive trend, i.e. higher the alcohol and sulphate content per sample, higher the ratings they got from wine experts.
NOTE: Sulphate and alcohol played a part in improving the quality rating, hence strengthening each other in case of red wine samples.
Relationship of Residual Sugar and Alcohol with Quality: The density plots for residual sugar with curves colored by quality rating did not really show a clear trend. Most of the quality curves are stacked in the same ranges of residual sugar for both the wine types. The scatter plot for white wine did show a relation between the variables, i.e. samples with more residual sugar did receive higher ratings. The zoomed version of scatter plot for red wine samples showed that for higher ratings, the alcohol content was observed be higher and residual sugar content was observed to be lower and vice versa.
NOTE: In my opinion, variables residual sugar, quality and alcohol do strenthen each other to some extent for samples of red wine.

Were there any interesting or surprising interactions between features?

Interaction of the top most highly correlated variables are as follows:
Density and Residual Sugar with Quality for White wine: Density and residual sugar for white wine showed the highest correlation wih each other. I thought of investigating the trend of these two variables with our main variable of interest. I put all three on a acatter plot with color by quality and drew and imaginary line to roughly split the ratings. Line added a clear divide, for increase in residual sugar and density being contained with in the threshold set by the line, qualiy of the samples were observed to improve. While on the other hand, for the samples with increasing residual sugar level and density level above the line threshold, experts gave lower quality ratings.
Density and Fixed Acidity with Quality for Red wine: For red wine, fixed acidity and density showed the highest correlation. I used these two with quality to see if there was a trend. Imaginary line for reference was again drawn here and there was a trend oobserved. For increasing fixed acidiy per sample, density above the line threshold got good reviews i.e. high quality rating and density values less than the threshold received average to poor reviews.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = combined_wq)
## m2: lm(formula = quality ~ alcohol + pH, data = combined_wq)
## m3: lm(formula = quality ~ alcohol + pH + sulphates, data = combined_wq)
## m4: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5), 
##     data = combined_wq)
## m5: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide, data = combined_wq)
## m6: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)), data = combined_wq)
## m7: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity, 
##     data = combined_wq)
## m8: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid, data = combined_wq)
## m9: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)), data = combined_wq)
## m10: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)), 
##     data = combined_wq)
## m11: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar, data = combined_wq)
## m12: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)), data = combined_wq)
## m13: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)) + sugar.acid.ratio, 
##     data = combined_wq)
## m14: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)) + sugar.acid.ratio + 
##     taste, data = combined_wq)
## m15: lm(formula = quality ~ alcohol + pH + sulphates + I(sulphates^5) + 
##     free.sulfur.dioxide + I(free.sulfur.dioxide^(1/10)) + volatile.acidity + 
##     citric.acid + I(log10(density)) + I(log10(total.sulfur.dioxide)) + 
##     residual.sugar + I(log10(combined.acidity)) + sugar.acid.ratio + 
##     taste + fixed.acidity, data = combined_wq)
## 
## ===================================================================================================================================================================================================================
##                                     m1          m2          m3          m4          m5          m6          m7          m8          m9          m10         m11         m12         m13         m14         m15    
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept)                       2.405***    2.982***    2.987***    3.023***    2.277***   -1.983***   -1.028**    -0.927*     -1.316**    -2.164***   -2.466***    -4.232***   -4.160***   -4.748***   -3.973***
##                                  (0.086)     (0.204)     (0.204)     (0.204)     (0.209)     (0.396)     (0.388)     (0.401)     (0.404)     (0.407)     (0.406)      (0.574)     (0.594)     (0.605)     (0.747)  
## alcohol                           0.325***    0.328***    0.329***    0.329***    0.348***    0.342***    0.323***    0.323***    0.381***    0.352***    0.303***     0.261***    0.261***    0.246***    0.238***
##                                  (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.008)     (0.011)     (0.012)     (0.013)      (0.016)     (0.016)     (0.017)     (0.018)  
## pH                                           -0.189**    -0.241***   -0.267***   -0.195**    -0.187**     0.073       0.055       0.024      -0.044       0.101        0.360***    0.359***    0.402***    0.417***
##                                              (0.061)     (0.062)     (0.063)     (0.062)     (0.061)     (0.061)     (0.064)     (0.063)     (0.063)     (0.065)      (0.088)     (0.088)     (0.090)     (0.090)  
## sulphates                                                 0.284***    0.385***    0.551***    0.681***    0.821***    0.834***    0.693***    0.547***    0.767***     0.829***    0.828***    0.844***    0.850***
##                                                          (0.066)     (0.076)     (0.076)     (0.076)     (0.074)     (0.075)     (0.077)     (0.078)     (0.082)      (0.083)     (0.083)     (0.083)     (0.083)  
## I(sulphates^5)                                                       -0.038**    -0.043**    -0.052***   -0.050***   -0.051***   -0.043**    -0.034**    -0.040**     -0.039**    -0.039**    -0.039**    -0.038** 
##                                                                      (0.014)     (0.013)     (0.013)     (0.013)     (0.013)     (0.013)     (0.013)     (0.013)      (0.013)     (0.013)     (0.013)     (0.013)  
## free.sulfur.dioxide                                                               0.007***   -0.009***   -0.009***   -0.008***   -0.009***   -0.013***   -0.014***    -0.014***   -0.014***   -0.015***   -0.015***
##                                                                                  (0.001)     (0.001)     (0.001)     (0.001)     (0.001)     (0.001)     (0.001)      (0.001)     (0.001)     (0.001)     (0.001)  
## I(free.sulfur.dioxide^(1/10))                                                                 3.433***    2.517***    2.499***    2.685***    4.660***    4.717***     4.700***    4.705***    4.799***    4.821***
##                                                                                              (0.272)     (0.268)     (0.269)     (0.269)     (0.319)     (0.317)      (0.317)     (0.317)     (0.316)     (0.316)  
## volatile.acidity                                                                                         -1.247***   -1.269***   -1.429***   -1.558***   -1.380***    -1.396***   -1.399***   -1.397***   -1.300***
##                                                                                                          (0.063)     (0.067)     (0.071)     (0.071)     (0.074)      (0.074)     (0.074)     (0.074)     (0.092)  
## citric.acid                                                                                                          -0.070      -0.206**    -0.131       0.013       -0.116      -0.118      -0.112      -0.045   
##                                                                                                                      (0.072)     (0.075)     (0.074)     (0.076)      (0.081)     (0.081)     (0.081)     (0.090)  
## I(log10(density))                                                                                                                78.845***   60.937***  -44.217*    -132.737*** -132.288*** -151.958*** -163.429***
##                                                                                                                                 (11.236)    (11.239)    (17.213)     (26.631)    (26.649)    (28.208)    (28.938)  
## I(log10(total.sulfur.dioxide))                                                                                                               -0.598***   -0.758***    -0.763***   -0.767***   -0.798***   -0.799***
##                                                                                                                                              (0.053)     (0.056)      (0.056)     (0.057)     (0.057)     (0.057)  
## residual.sugar                                                                                                                                            0.028***     0.043***    0.050**     0.024       0.033   
##                                                                                                                                                          (0.004)      (0.005)     (0.016)     (0.017)     (0.018)  
## I(log10(combined.acidity))                                                                                                                                             1.267***    1.196***    1.600***   -0.092   
##                                                                                                                                                                       (0.291)     (0.326)     (0.348)     (1.017)  
## sugar.acid.ratio                                                                                                                                                                  -0.057       0.322*      0.262   
##                                                                                                                                                                                   (0.118)     (0.132)     (0.136)  
## taste: .L                                                                                                                                                                                      0.179       0.183   
##                                                                                                                                                                                               (0.559)     (0.559)  
## taste: .Q                                                                                                                                                                                      0.550       0.555   
##                                                                                                                                                                                               (0.408)     (0.407)  
## taste: .C                                                                                                                                                                                      0.365       0.367*  
##                                                                                                                                                                                               (0.187)     (0.187)  
## fixed.acidity                                                                                                                                                                                              0.089   
##                                                                                                                                                                                                           (0.050)  
## -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared                            0.197       0.199       0.201       0.202       0.223       0.242       0.284       0.285       0.290       0.304       0.311       0.313       0.313       0.318       0.318 
## adj. R-squared                       0.197       0.198       0.200       0.201       0.222       0.241       0.284       0.284       0.289       0.303       0.309       0.311       0.311       0.316       0.316 
## sigma                                0.782       0.782       0.781       0.780       0.770       0.761       0.739       0.739       0.736       0.729       0.726       0.725       0.725       0.722       0.722 
## F                                 1597.641     804.749     544.025     410.377     372.584     344.582     368.568     322.613     294.371     282.976     265.644     245.759     226.845     188.468     177.624 
## p                                    0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000       0.000 
## Log-likelihood                   -7623.404   -7618.549   -7609.411   -7605.540   -7518.180   -7439.467   -7250.383   -7249.909   -7225.344   -7161.667   -7129.474   -7120.000   -7119.884   -7096.585   -7095.014 
## Deviance                          3975.734    3969.796    3958.645    3953.930    3849.017    3756.874    3544.442    3543.924    3517.226    3448.953    3414.943    3404.997    3404.876    3380.543    3378.909 
## AIC                              15252.809   15245.098   15228.821   15223.079   15050.360   14894.933   14518.766   14519.817   14472.687   14347.334   14284.948   14267.999   14269.768   14229.170   14228.028 
## BIC                              15273.146   15272.214   15262.717   15263.754   15097.814   14949.166   14579.778   14587.608   14547.257   14428.683   14373.076   14362.907   14371.454   14351.194   14356.831 
## N                                 6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497        6497     
## ===================================================================================================================================================================================================================
I created a model for predicting the quality of wine samples. The variables used in this model are combination of original plus the derived and log transformed variables. Although, the linear model only showed an r^2 of 0.32, I tried to put in as useful features I could. Feature selection was a combination of correlation coefficient calculations performed in the previous sections and trial & error used to check which combination (log or power multiples) gave better results.

Final Plots and Summary

Plot One

Description One

First plot is chosen from the univariate section. It shows the frequency polygons for the distribution of alcohol for White and Red wine samples. The frequency polygons show that for most of the samples for both wine types, alcohol percentage is between 9% and 13%. Apart from the difference in the number of samples, just by looking at the plot it seems that both wine types have same composition of alcohol with multiple peaks 9.5, 10.1, 10.5 and 11.1% for white wine and 9.5, 9.8, 10.5 and 10.9% for red wine samples. There seems to be a similarity in the shapes of the polygons too and reminds me of a geometric transformation with approximate enlargement factor of 2.

Plot Two

Description Two

Plot two is taken from the bivariate section of this report. It shows the box plot representation of Alcohol vs. Quality for samples of Red wine. Stat summary function is used to plot mean value of alcohol for each qualiy rating bucket denoted by a red cross inside a box plot. For lower quality ratings (3-5), mean and median alcohol levels fluctuate and as we progress through higher quality buckets there is an increasing trend seen in percentage of alcohol.

Plot Three

Description Three

Third plot was taken from the multivariate section. Frequency polygon and box plot have already been discussed in the previous plots so I though of giving scatter plot a make over. Legend for quality is set in such a way that redish tint corresponds towards lower quality rating and greenish tint towards higher quality rating. It can be clearly observed from the plot that there is a good enough correlation between sulphates and alcohol. For higher values of alcohol (11% and above) and sulphates above 0.65 g/dm^3 we can clearly see that quality rating improves and for lower values of alcohol content and sulphates below 0.6 g/dm^3 the quality rating tends to suffer.

Reflection

Analysis and Obervations

The analysis of the two wine data sets kicked off with the selection of main features. As the main objective of the project was to determine the quality rating of a particular wine sample based on its chemical composition, ‘quality’ and a very close associated variable ‘alcohol’ were chosen to be the two main features. Main features held the center stage and analysis revolved around them. In addition to the main features, some of the other variables in the data sets were considered as supporting features.
In the univariate analysis section fo the project, I had assumed that along with the main features, supporting features like residual sugar and acidic contents (fixed, volatile and citric acid) would play an important role in the analysis. I had also thought features like pH, density and sulphate content would be very useful in determining the quality of wine samples.
All feature selection was considered for both the wine types in general. Conducting extensive comparitive analysis of the features in bivariate and multivariate sections revealed some interesting relationships and combinations that might effect the quality of wine samples.
The sequence of the analysis conducted and the successes/difficulties faced during the analysis of the two wine data sets is summarized below:
I started with analyzing the relationship between quality and alcohol. Correlation coefficient and the scatter plot showed positve relation between the two, however, the exact trend was difficult to pin point. I took an alternative approach and created box plots and showed mean alcohol levels for each quality bucket. The box plots and the mean values showed that although there was a positive correlation but it was not linear for all the buckets of quality. For both the wine type, alcohol content at lower quality ratings, showed degradation and then from quality rating of 5 onwards there was a positive linear trend observed.
Evidence of this observation is shown in the correlation coefficient calculation done below (I have not included this in an r chunk as this just for illustration purposes). Subsetted by quality, both the wine types are compared for lower and upper buckets of quality ratings.

Comparison of r^2 for subsets for White wine

cor.test(alcohol, quality) = 0.44
with(subset(wqw, quality < 5), cor.test(alcohol, quality)) = -0.06
with(subset(wqw, quality > 4), cor.test(alcohol, quality)) = 0.47

Comparison of r^2 for subsets for Red wine

cor.test(alcohol, quality) = 0.48
with(subset(wqr, quality < 5), cor.test(alcohol, quality)) = 0.12
with(subset(wqr, quality > 4), cor.test(alcohol, quality)) = 0.52
Apart from the dipping trend in the relationship, there was an overall positve correlation.
In addition to alcohol and quality, I also thought pH could help decide the quality of a particular wine sample. To find out the relation between these variables I started off by creating scatter plots and calculating r^2 between alcohol and pH. For the both wine types, I did get slightly positive results. I further plotted box plots to check relation between quality and pH for white wine samples. Although, the correlation seemed poor but there was a similar trend seen as between alcohol vs. quality (pH dropped for low quality samples and then very slightly increased for the samples with high quality ratings). I left this exploration to be further investigated in the multivariate section creating density plots for pH with curves representing each quality rating bucket. For both the wine types, most of the samples were in the pH range between 3.0-3.4 and pH did not favored any particular quality rating as most of the curves were clustered around the range mentioned above. Finally, I created scatter plots for alcohol vs. pH and points colored by quality. With less alcohol content and pH on a higher side, experts tend to give lower ratings. In contrast to this for more alcohol content and pH controlled leads the expert to rank the samples high. All this exercise revolving around pH did not turn out to be as expected. I had thought that pH would play a very important part in decision making for quality but to my surprise the relationship although positive was very poor in order for me to be convinced that pH could play a decisive part.
Along with pH, I thought acidic content would be an important factor too. Taking my analysis a step further I plotted denisty curves in the multivariate section for all acidic types with respect to each quality bucket. Density plots of fixed and volatile acidity for both wine types showed similar variations with quality rating being high for samples that have acidic levels under control i.e. around 6-8 g/dm^3 for fixed acidity and 0.3-0.4 g/dm^3 for volatile acidity. Citric acid plot revealed that red wine quality did get better with increase in the acid content while white wine plots remained inconclusive. These plots were followed by scatter diagrams which showed that for white wine acidic content did not play a very important role compared to red wine samples. Fixed acidity and citric acid for red wine seemed to be the main contributing factors towards quality.
Sugar is an integral part of any drink, as is the case in wines. Creating density plots and scatter diagrams revealed that white wine samples had more residual sugar compared to red wine. Quality ratings were mixed and sugar content did seem to effect the quality of a particular samples for both the wine types.
To generalize all the analysis conducted I have come up with the following equations to roughly relate the some of the variables in the data sets.
White wine: Quality ~ Alcohol + pH(to some extent) + Residual Sugar
Red wine: Quality ~ Alcohol + pH(to some extent) + Sulphates

Limitations

One of the limitations of the data set that I observed was that it was taken for only one type of wine (Vinho Verde). If more wine data from other wine types/manufacturers was combined with the existing data set it could help us generalize this analysis and apply it to wine all over the world in general.

Questions and Future work

Additional variables could have been added to the data set in order for better analysis and prediction. I added two variables ‘taste and ’taste due to pH’ mentioned in the univariate analysis section in order for better understanding and prediction for the quality of the wine samples. The data used to create the variables was based on assumption for riesling wine and was only used to add categorical variables. Based on what I analyzed using these variables, it would have been very useful if the original wine data sets had included a variable such as this for taste for Vinho Verde wines in specific. So future work suggestion would be to enhance the data set by adding a variable to account for taste.

References